fix(pool): probe liveness before reusing session-pinned WS upstream#3021
fix(pool): probe liveness before reusing session-pinned WS upstream#3021thiscantbeserious wants to merge 1 commit intomaximhq:mainfrom
Conversation
…q#3002) Pool.Get now performs a 1ms read-deadline probe on each idle connection before handing it out. A timeout error (no queued data) confirms the connection is alive. A close frame, EOF, or any other error reveals that the upstream has closed the connection server-side; the stale entry is discarded and a fresh dial is performed. This closes the window where a session-pinned upstream connection appeared live (IsClosed() == false, age within limits) but carried a buffered close frame from a previous response, causing the next request to fail immediately on first read. Adds TestPoolGetEvictsStaleSessionConn: mock WS server that closes the connection after delivering its response; asserts Pool.Get dials fresh. Closes maximhq#3002 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ure-branch duplicate) The liveness probe for stale session-pinned WS pool connections (maximhq#3002) was implemented directly on fix/3002-pool-stale-session-conn off main. The feature branch never contained a parallel implementation of this probe, so there is no code to remove. This commit marks the extraction as complete and links the focused fix to this branch for traceability. Related: maximhq#3021, maximhq#3002 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Superseded by #3018 (merged 2026-04-24), which bundles the fix for this issue and several other native WS reliability bugs. Closing this draft as redundant. |
Summary
After the upstream WebSocket server sends its terminal event and then closes the connection (e.g. with code 1011 on keepalive timeout), Bifrost would hand out the same connection to the next request on the same client session. The connection appeared live (IsClosed() returned false, age was within limits) because the WS close frame sat buffered in the socket and had never been read. The next caller would succeed on WriteMessage (OS buffer accepts the write) and then immediately fail on the first ReadMessage when the buffered close frame was returned as an error. This caused a fallback to the HTTP bridge and an error frame to the client.
The root cause is that Pool.Get relied only on an atomic bool (set by Bifrost's own Close call) and wall-clock expiry checks to determine connection health. Neither of those detects a server-side close that Bifrost never consumed.
Changes
transports/bifrost-http/websocket/pool.go: added(*Pool).isLivehelper that sets a 1ms read deadline on an idle connection and calls ReadMessage. A timeout error (no queued data) confirms the connection is alive. A close frame, EOF, or any other error reveals it is stale. The probe runs in Pool.Get after the existing IsClosed / isExpired checks, before incrementing inFlight and returning the connection to the caller. Stale connections are closed and the loop continues to the next idle candidate or a fresh dial.transports/bifrost-http/websocket/pool_test.go: addedTestPoolGetEvictsStaleSessionConn. A mock WS server accepts one connection, then closes it server-side after the test returns the connection to the pool. Pool.Get is called again and must produce a fresh connection (new dial), not the stale one.Type of change
Affected areas
How to test
Expected: TestPoolGetEvictsStaleSessionConn passes, all existing tests pass, no vet warnings.
Integration test (manual): configure a WS upstream that closes connections after the terminal event. Send two sequential requests on the same Bifrost WS session with a pause between them. Before this fix, request 2 would receive an error frame. After this fix, request 2 succeeds with a fresh upstream connection.
Breaking changes
Related issues
Closes #3002
This fix was discovered while working on PR #2775 (openai-oauth-passthrough). The feature branch in that PR contains an earlier version of the liveness probe, which has been extracted here and removed from the feature branch in a separate commit.
Security considerations
None. The probe is a local read with a 1ms deadline on a connection that Bifrost owns exclusively while it is in the idle pool. No external input is involved and no credentials are affected.
Checklist
docs/contributing/README.mdand followed the guidelines